Stochastic Approximation for Non-Expansive Maps:1 Application to Q-Learning Algorithms
نویسندگان
چکیده
We discuss synchronous and asynchronous variants of fixed point iterations of the form xk+1 = xk + γ(k) ( F (xk, ξk)− xk ) , where F is a non-expansive mapping under a suitable norm, and {ξk} is a stochastic sequence. These are stochastic approximation iterations that can be analyzed using the ODE approach based either on Kushner and Clark’s Lemma for the synchronous case or Borkar’s Theorem for the asynchronous case. However, the analysis requires that the iterates {xk} are bounded, a fact which is usually hard to prove. We develop a novel framework for proving boundedness, which is based on scaling ideas and properties of Lyapunov functions. We then combine the boundedness property with Borkar’s stability analysis of ODE’s involving non-expansive mappings to prove convergence with probability 1. We also apply our convergence analysis to Q-learning algorithms for stochastic shortest path problems and we are able to relax some of the assumptions of the currently available results. 1 Research supported by NSF under Grant 9600494-DMI. Thanks are due to John Tsitsiklis whose suggestions resulted in important simplifications of the lemmas in Section 2. 2 Dept. of Electrical Engineering and Computer Science, M.I.T., Cambridge, Mass., 02139. 3 School of Technology and Computer Science, Tata Institute of Fundamental Research, Homi Bhabha Road, Mumbai 400005, India. 4 The research of V. Borkar was supported in part by the Homi Bhabha Fellowship, and Govt. of India, Dept. of Science and Technology grant No. III 5(12)/96-ET. 1
منابع مشابه
Stochastic approximation for non-expansive maps : application to Q-learning algorithms
We discuss synchronous and asynchronous iterations of the form x = x + γ(k)(h(x) + w), where h is a suitable map and {wk} is a deterministic or stochastic sequence satisfying suitable conditions. In particular, in the stochastic case, these are stochastic approximation iterations that can be analyzed using the ODE approach based either on Kushner and Clark’s lemma for the synchronous case or on...
متن کاملTwo-Timescale Q-Learning with an Application to Routing in Communication Networks
We propose two variants of the Q-learning algorithm that (both) use two timescales. One of these updates Q-values of all feasible state-action pairs at each instant while the other updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A sketch of convergence of the algorithms is shown. Finally, numerical experiments using the proposed algorithms fo...
متن کاملEmpirical Q-Value Iteration
We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and ‘actor-critic’ algorithms, this algorithm doesn’t depend on a stochastic approximation-based method. We show that our algorithm, which we call ...
متن کاملLids - P - 2172 Asynchronous Stochastic Approximation and Q - Learning 1
We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establish its convergence under conditions more general than previously available.
متن کاملNew algorithms of the Q-learning type
We propose two algorithms for Q-learning that use the two timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state-action pairs at each instant while the second updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments usin...
متن کامل